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ABSTRACT 



A computer filing system includes a data access and allo- 
cation mechanism including a directory and a plurality of 
indexed data files or hash tables. The directory is preferably 
a radix tree including directory entries which contain point- 
ers to respective ones of the hash tables. Using a plurality of 
hash tables avoids the whole database ever having to be 
re-hashed all at once. If a hash table exceeds a preset 
maximum size as data is added, it is replaced by two hash 
tables and the directory is updated to include two separate 
directory entries each containing a pointer to one of the new 
hash tables. The directory is locally extensible such that new 
levels are added to the directory only where necessary to 
distinguish between the hash tables. Local extensibility 
prevents unnecessary expansion of the size of the directory 
while also allowing the size of the hash tables to be 
controlled. This allows optimisation of the data access 
mechanism such that an optimal combination of directory- 
look-up and hashing processes is used. Additionally, if the 
number of keys mapped to an indexed data file is less than 
a threshold number (corresponding to the number of entries 
which can be held in a reasonable index), the index for the 
data file is built with a one-to-one relationship between keys 
and index entries such that each index entry identifies a data 
block holding data for only one key. This avoids the over- 
head of the collision detection of hashing when it ceases to 
be useful. 

12 Claims, 6 Drawing Sheets 
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INDEXED FILE SYSTEM AND A METHOD An access method dubbed "extendible hashing'' was 

AND A MECHANISM FOR ACCESSING described by R. Fagin et al in "Extendible Hashing— A Fast 

DATA RECORDS FROM SUCH A SYSTEM Access Method for Dynamic Files", ACM Transactions on 

Database Systems, Vol.4, No.3, September 1979, pages 

FIELD OF INVENTION 5 315-344. This paper describes a particular adaptation of 

hashing which makes hash tables extendible by separating 

The present invention relates to accessing data records the hash address space from a directory address space. The 

from a database, and in particular to accessing data from a hash table, which includes a directory with each entry 

dynamic database to which data elements can be added and pointing to a disk file or bin within the table, is extendible 

from which data elements can be deleted such that the data since additional bins are added if a bin overflows and the 

volume (and not just data values) can change over time. directory is extended to use an additional bit of information 

from each input hash key to distinguish between the 

BACKGROUND OF THE INVENTION increased number of bins. The most significant bits of the 

With any database, it is necessary to have a file structure ^"^^ ^^^J^^ . ^^^^^ address space and 

and access method which enables efficient access to data ^^e number of significant bits used is increased if any bm 

.„..,.. ... jL» overflows. This extension is done without rehashing the 

records. Hashing techniques which access records by trans- ^^^^^ ^^^^^ ^^^^ ^.^ ^ ^^^^^^^ 

formmg keys to addresses are weU known m the art. A hash ^^^^ ^^^^ /^^^(^^ ^ unchanged. There is a 

table comprises an mdexed data file which is held m storage j^j^^ ^-^^ ^^^^^^ ^-^^ ^-^^^^^^ ^^^^y^^ ^ ^j^^ 

with the index bemg used to associate data keys with ^^ch time a bin overflows, and so the hashing table which 

addresses of particular data storage blocks or "bins" withm 20 exists to enable efficient access to data elements may itself 

the file. An input hash key is transformed by a hashing take up too much of the available disk space. This is a 

algorithm (which may be a simple numerical division or a particular problem for very large databases to which data 

more complex transformation) and is then compared with records are added randomly and for which the hash table is 

the index of the hash table held in system memory to obtain too large to be held in memory. Furthermore, certain oper- 

the address of the relevant bin within a data file which is held 25 ating systems impose a restriction on the maximum size of 

in a peripheral storage device, A determination is then made a hash table such that they are not extensible beyond that 

of which data elements in this data file bin arc the relevant maximum. 

ones for the key. This determination is known as "disam- The Fagin et al paper also describes the same solution 
biguation" or "collision handling", and may be as simple as from a different perspective, referred to as a "balancing" of 
comparing the hashed key with the stored keys of elements 33 radix search trees. Radix search trees examine an input key 
in the file. ^^S}^ letter at a time, and the search focuses on a 
Hiere arc problems with pure hashing techniques in particular branch of the tree at each step through the tree 
relation to large databases. If bins are too large (i.e. there are hierarchy. Radix trees can provide faster access han search 
, , , . . * J T ^1 u- \ ♦u trees which compare whole keys, but they typicaUy use more 
too many data elements in each separate data file bm) then ^ ^ J ^ 'flattened' directory 
the colhsion handlmg part of the data access proce^ is too 35 ^^^J^ ^hi^h seeks to maximize access speed by flatten- 
slow. If bins are too small then the hash table itself is too • directory structure such that a single pointer from the 
large and takes up loo much disk space. For a very large directory accesses a required file. The depth of the directory 
database, conventional hashing tables wiU take up too much (how many levels are flattened into one, and so how many 
disk space for any reasonable bin size and may be too large significant bits of a key are used in a search) can be varied 
to transfer their index into memory. However, hashing is still 40 as necessary to guarantee access to a required file in a single 
a very effective access technique for smaller databases. probe. Each time a "leaf (a page of memory at the bottom 
Hashing was origjnaUy used with static data structures of the directory hierarchy) overflows, requiring a newdirec- 
static' in the sense that the extent and structure of the data tory level, the directory is doubled in size to extend it while 
remain unchanged during processing and only data values keeping a flattened structure. The radix tree has thus degen- 
areupdated). The first adaptations of such static stmctures to 45 erated into a one-step access mechanism for maximum 
allow for insertions and deletions required deletion-flags and speed, but at considerable cost in memory unless the keys 
pointers to 'overflow bins' which were separate from the are very uniformly spread over the key space, 
main data structure. Frequent expensive restnicmring of the p, A. Larson's paper "Dynamic Hashing", BIT vol.18, 
data structure was required (typically when the number of No.2 (1978), pages 184-201, describes a further file organi- 
holes left by deletions, and overflow areas created by 50 sation based on hashing in which the allocated storage space 
insertions, became sufficient to degrade performance can be increased and decreased without rehashing the whole 
significanUy). Adaptations of hashing techniques for use file. This is achieved by providing an index to a data file 
with dynamic databases required costly 'rehashing' when- which index is organised as a series of binary hash trees, 
ever a bin within the hash table became over-full (i.e. when nodes of the index including pointers to particular buckets 
the keys pointing to an individual disk file have too many 55 within the data file. The problem which arises with this file 
data items associated with them). Rehashing involves the oiganisation is that the data file can become difficult to 
choice of a new fixed size for the hash table and bins within manage as a contiguous file as data is added and the file 
the table, a new hash function, and relocation of all records grows, and so this solution is not well suited to large 
within the table. Opting for a hash table size and bin size that databases. 

uses a high estimate of the number of records to be placed 60 ixjik^ adv r^n iKn/cKmnxT 

therein would minimize rehashing frequency but would also SUMMARY OF INVENTION 

result in valuable disk space being wasted. Underestimating In a first aspect, the present invention provides a data 

hash file storage requirements results in a large number of access mechanism for a computer filing system, the access 

overflow records (slowing down searching and updating) mechanism including: 

and frequent rehashing. More efficient and adaptable file 65 a directory having a plurality of directory entries at least 

organisations and access techniques are required for some of which provide pointers to respective indexed data 

dynamic databases. files; 
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logic for analysing a data key input to said filing system, directory) providing pointers to a variable number of 

by comparison of the data key with entries in the directory, indexed data files to access data within a dynamic file 

to identify one of said plurality of directory entries providing system. The invention according to this embodiment enables 

a pointer to a respective indexed data file; the proportion of the access processing which is performed 

a plurality of indexed data files for storing data; and 5 by the directory and the proportion performed by hashing to 

a plurality of indices, each corresponding to one of said be varied as the database grows (or shrinks) such that the 

indexed data files- access method remains optimal for the current database 

wherein each of'said indexed data files is locatable via one characteiKti« If certain input data keys have a lot of data 

of said pointers, wherein the index for each indexed data file associated with them, then the data file for that area of the 

contains identifiers and storage addresses of data blocks „ ^ey space will be unplemented as a hash table which only 

within the indexed data file, each indexed data file having contains data for a relatively smaU number of different keys 

associated logic for transforming an input data key to *ere will be relatively Imle collision (i.e a large 

generate a transformed key and logic for comparing a proportion of the data in a bm wi l be a^oaated with a small 

Transformed key with data block identifiers in the index to n"?*'" of keys or the same key). If a data file has a 

identify a match and to obtain the address of an identified sufficiently smaU number of keys associated with its data 

data block within the indexed data file, and each indexed " 'he" l^* '^^'^ f '"^ex may point to » data file bin that 

-1 . CI u • • f..«w« contams data for only this key and then the colhsion 

data file having associated logic for identifymg from records ^^^^^^.^^ ^^^^ ^.j, ^ superfluous for data within 

withm an identified data block one or more data records ,u„, u- 

1 J L • J 1 that Din. 

related to the input data key. The present invention according to a preferred embodi- 

^rhe directory is preferably an hierarchical directory such 20 ment takes advantage of this characteristic. When a data file 

as a radix tree directory, for iteratively analysing the input ^ ^^^^^ number of keys to be mapped to each 

data key to idenUfy the respective pointer, one iteration replacement disk file wiU be no greater than the number of 

being performed by a comparator at each successive node of ^ins within each file, the data allocation mechanism of the 

the hierarchy to select a path through the directory from that present invention organizes the replacement disk files such 

node towards a "leaf node" comprising the lowest level of 25 that each key is mapped to a separate bin, and then no 

the hierarchy for the respective branch of the hierarchy, each collision detection is needed for these disk files. The cost of 

leaf node providing a pointer to a respective hash table, and the collision detection step of hashing has thus been avoided* 

each iteration of the analysis using one or more additional for any disk files for which the relevant key space can be 

bits of information from the input key (i.e. additional to the represented by an index which maps one key to one data bin. 

information used by the previous node) to distinguish 3Q This avoidance of collision detection for areas of the key 

between keys until a leaf node is reached. As noted space having lots of associated data provides for high 

previously, a hash table is an indexed data file having an performance when the database is large, and local extensi- 

associated algorithm for transforming an input data key to a bility of the directory maintains optimal performance when 

data string that can be compared with the index to identify usage of the key space is uneven (that is, high performance 

a data block within the data file, and having a process for 35 is possible if the disk space and memory space of a large 

identifying data records relevant to the input key from the system are also available). If the system is small then the 

records within the data block. economy of resoutrces of hashing is desirable and the cost of 

The directory according to the invention is preferably collision detection is acceptable because there is not much 

locally extensible such that the number of bits of an input data, 

key used to identify a hash table can be varied over time and 40 A data access mechanism and a filing system according to 

according to the amount of data associated with different a preferred embodiment of the invention has a predefined 

areas of the key space. Since leaf nodes of the directory can maximum hash table capacity, the hash tables each being 

point to different hash tables, the number of hash tables extensible up to said predefined maximum. For example, 

allocated can also be varied such that each hash table when a data "bin" (a block of storage) within a hash table 

remains at a manageable size. 45 becomes full, that bin may be moved to a new storage 

The combination of an extensible directory and a variable location corresponding to a larger storage area such that the 
number of hash tables permits efficient access to data records toUl capacity of the hash table is increased. This is preferred 
within a very large database as well as efficient access when to the alternative of replacing a bin with two separate storage 
the database is small. A pure directory-look-up approach for blocks since keeping each bin as a contiguous storage block 
a very large database would require a directory which takes 50 enables data access via a single disk access, 
up too much storage space and, since the directory would be In response to a first hash table exceeding the maximum 
unlikely to fit within memory, the number of disk accesses capacity, the data access mechanism is adapted to replace the 
required to step through the directory may result in poor first hash table with a plurality of separate hash tables and to 
access performance. A pure directory approach for a smaU extend the directory by addition of new nodes located on 
database would be wasteful of memory space. The alterna- 55 branches of the tree extending from the node of the directory 
tive of a pure hashing approach is likely to be slow for very which previously identified the pointer to the first hash table, 
large databases since the need to control the size of the hash The additional nodes form a new lowest level of the direc- 
table as it is represented within memory will result in very tory hierarchy for the respective branch — the previous "leaf 
large data bins within the table as stored in secondary node" becomes a branching node of the directory for select- 
storage such that collision handling will be slow. By using 60 ing between the new "leaf nodes". The extended directory 
a number of separate hash tables, it is possible to avoid ever then uses one or more additional bits of information from 
having to rehash the entire database all at once and the size input data keys (i.e. the next significant bits of the sequence 
of each hash table can be managed effectively. There is no of bits forming the key) to distinguish between the keys to 
requirement for the separate hash tables to be stored together identify a leaf node containing a pointer to a respective one 
in a contiguous block of storage. 65 of said two or more hash tables. 

The invention according to a preferred embodiment thus This avoids the problem of the Fagin et al prior art of 

uses a locally extensible directory (which may be a radix tree repeated doubling of a directory each time a data file 
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overflows, since the I6af nodes are not required to all be at 
the same level in the hierarchy and only the required parts 
of the directory are extended. Since the directory of the 
present invention identifies a pointer to one of a number of 
hash tables, rather than to specific data blocks >vilhin a hash 5 
table, the directory only needs to be extended when a hash 
table exceeds its maximum capacity and is split rather than 
being extended whenever splitting of a data block within a 
hash table increases the depth of the hash table as in Fagin 
et al. Similarly, the directory is more easily shrinkable when 30 
deletion of data from the database leaves under-used data 
bins than would a directory according to Fagin et al. Hash 
tables as implemented within the preferred embodiment of 
the invention are extensible by expansion of data blocks 
(allocating a larger area of storage for the bin and transfer- 15 
ring the data to the new storage area), but this only requires 
updating of the header block within the respective hash table 
and not updating of the directory (imless the hash table has 
reached its maximum capacity). 

In a second aspect, the invention provides a filing system 20 
for a computer including: 

a plurality of indexed hash tables for storing data, each 
table being identifiable by a respective one of a plurality of 
pointers; and 

a directory for analysing input data to identify a respective 
one of the plurality of pointers, thereby to select one of the 
plurality of hash tables for storing the input data. 

According to a prefened embodiment, the invention pro- 
vides an indexed file system including a data processing 30 
system having a processor, memory, peripheral storage, and 
communication buses for transferring data between said 
processor, memory and peripheral storage, said indexed file 
system also including a data allocation and access mecha- 
nism comprising: 35 

a directory, stored in said memory, including a plurality of 
directory entries at least some of which provide pointers to 
respective indexed data files; 

logic, stored in said memory, for use by said processor to 
analyse a data key input to said filing system, by comparison 
of the data key with entries in the directory, to identify one 
of said plurality of directory entries providing a pointer to a 
respective indexed data file; 

a plurality of indexed data files for storing data in said 
peripheral storage; and 

a plurality of indices, each corresponding to one of said 
indexed data files; 

wherein each of said indexed data files is beatable via one 
of said pointers, wherein the index for each indexed data file 50 
contains identifiers and storage addresses of data blocks 
within the indexed data file, each indexed data file having 
associated logic for use by said processor to transform an 
input data key to generate a transformed key and logic for 
use by said processor to compare a transformed key with 55 
data block identifiers in the index to identify a match and to 
obtain the address of an identified data block within the 
indexed data file, and each indexed data file having associ- 
ated logic for use by said processor to identify from records 
within an identified data block one or more data records 
related to the input data key. 

The data access mechanism according to the invention is 
suitable for implementation in an indexed file system storing 
biometric data, where a plurality of input keys represent 
biometric data for an individual, such as a scanned finger- 65 
print. A plurality of input keys (possibly hundreds or even 
thousands of data elements making up the biometric data for 
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the individual) may be analysed within a parallel processing 
system which includes the data access mechanism of the 
invention, including comparing each key with a large data- 
base of biometric records (for example, a database holding 
thousands of data elements for each of several million 
individuals) to identify matches between each of the keys 
and stored records. An individual person can be identified or 
verified if the result of this keyed access to the large database 
is a significant number of matches between the input bio- 
metric data keys and stored biometric data records for that 
individual. 

The characteristics of such an indexed file system for 
identification and/or verification of individuals using bio- 
metric data are that the size of the database may change by 
many orders of magnitude over time, and the key space is 
typically very unevenly tised (i.e. there are long stretches of 
keys with no data while some keys have lots of data). The 
present invention is well suited to databases having these 
characteristics. 

The method of the preferred embodiment is highly scal- 
able and can support growth of the database by many orders 
of magnitude — ^in fact by a factor of 2 raised to the power of 
the number of bits in the key. At its smallest the database can 
be one hash table, with all of the data for all of the keys 
within it. At its largest, the database can include a separate 
disk file for every key, but will typically include a separate 
hash table for one or very few keys in areas of the key space 
which include a lot of data as well as hash tables represent- 
ing larger numbers of keys in areas of the key space which 
have little data. 

A method of data access according to the preferred 
embodiment of the invention is highly suitable for imple- 
mentation in a parallel processing system, since once the 
data has been allocated to one node in a parallel system it can 
stay there. A radix tree directory can be analysed to provide 
a clear decision as to which node in a parallel system the data 
for a given key belongs, by means of pointers at the leaf 
nodes which identify both a particular system node and a 
particular disk file stored on that node. 

The method of the preferred embodiment provides fast 
access to keyed data for standard operations such as insert, 
delete and retrieve. Data for any given key can usually be 
retrieved with a single disk operation, and so the time taken 
to retrieve the data remains almost constant over a very wide 
range of sizes of the database. 

BRIEF DESCRIPTION OF DRAWINGS 

An implementation of the invention will now be described 
in more detail, by way of example, with reference to the 
accompanying drawings in which: 

FIG. 1 is a schematic representation of a biometric data 
capture and indexed file system in which the invention is 
implemented; 

FIG. 2 is a representation of an example data organisation 
map according to an embodiment of the invention, showing 
both conceptual (FIG. 2A) and physical (FIG. 2B) repre- 
sentations of the directory and an example hash table; 

FIG. 3 is a schematic flow diagram showing the steps of 
a method of aUocating and accessing data records according 
to an embodiment of the invention; and 

FIGS. 4 and 5 are schematic flow diagrams showing the 
steps of particular aspects of a method of reorganising a 
database according to an embodiment of the invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENT 

FIG. 1 shows a file system 10 according to the invention 
receiving data from a fingerprint scanner 20. The scanner 20 
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captures an image of a fingerprint and passes this to a N bit numbers beginning with the same initial sequence of 

connected computer 30 which analyses the image to identify bits are not necessarily located within the same node of the 

"minutiae" (fingerprint feature points comprising singulari- parallel file system, since that would constrain the distribu- 

ties in ridge patterns such as bifurcations or "forks", and tion of data across the nodes and may prevent optimal data 
ridge endings). These fingerprint feature points are then 5 sharing, but each hash table 100 is located in the secondary 

grouped into local subsets and data is generated which storage (for example, disk storage) 90 of a single system 

charactenses the fingerpnnt image m the area of the group node 60 to enable efficient data access. If sufiScient memory 

of feature points. As an example, five thousand feature point ^y^ilMc, then the indexes UO of the hash tables 

subsets may be identified and used to generate the' charac- ^^^^^ ^ ^^^^^ ^^^^ ^^^^ i„ ^^^^ ^ 

eristic data describmg a single fingen)rmt. Each of th^ ^ ^ 

feature pomt subsets can be represented by an N bit number . , j .u • u 1 j- 1 * j 

(for example, a 30 bit numbed). "^^^""^^ P^Sf ^. peripheral disk storage 90 and 

^ *. c only transferred mto memory 80 when required for data 

Generation of characteristic data from a fingerpnnt image ,, j . n j- «7-;i_- 1 j t.l vt 

is described in detail in a co-pending U.S. patent appUcation findmg hm a particular node, if the N 

Ser. No. 08/764,949, which is assigned to the assignee of the ^^f'"^" ^" ^h^" ^*^^y 

present application and is incorporated herein by reference. stored m the same disk file. 

However, the specific content of the data and the detail of its The logical partitioning of the data within files according 

generation are not essential to the present invention and the to the N bit number which represents part of that data 

detailed description of fingerprint image storage and match- provides a simple indexing scheme and an efficient access 

ing is included herein only as an example application of the mechanism. 

invention. 20 jj, addition to hash tables, the mechanism for allocating 

Each N bit number representing a feature point subset is data within the file system and for providing access to the 

then stored in the file system 10 together with a 4 byte data also employs a dynamically extendible radix tree direc- 

identifier number of the person from. which the fingerprint tory 130 for identifying particular ones of the hash tables, 

image was taken. This identifier number is then used to ^adix tree directory 130 is referenced when processing 
access additional personal infonnation for each person reg- 25 input key (k,) to perform an iterative analysis (using a 

istered with the system. Tins additional personal mformation ^ator to make a determination for each "node" or level 

IS stored in a separate database ^d acces^^^^ identifier ^^^^^^ ^ ^^^^^^ ^-^ ^ ^^^^ 

to avoid having to replicate the full personal mformation for ^. ./ 1 j j ^-r ^ . c 

u * J XT u T- n -f *• * J- distmguish between keys and identify a relevant one of a 

each stored N bit number. Typically, information stored in , r*' r u u * ui r^- * « j » *u 

jj-*' ij.u 1 - * • J- -J 1 n plurauty of hash tables. Directory nodes are thus concep- 

the additional database relatmg to an individual person will 30 T 1 j • • • . . u- l l u • 1 . j ut r 

... 1J.1J- jj 1 tual decision points at which a branch is selected. Leaf 

include personal data mcluding name, address, and national ,„ - ^ui *ii j c 

^ , . , t J J nodes compnsing the lowest level nodes of respective 

msurance number, social security number or identity card . . ^f, 5 * j * -a /m . 

. - L . *u • f • 1 J .1. J * branches of the radix tree directory provide pointers (P) to 

mformation, but the information may mclude any other data ^. u u * ui i/wmn -j * c j j t c 

c u- u 1 1 * u ui 4 -J -r respective hash tables 100,110 on identified nodes. Leaf 

for which It IS useful to be able to identify a person, or verify j 1 • 1 j « u- u ^ * c 1 r j 

. . .J , , J i_ 1 J . -1 • • 1 nodes also include a nag which identifies them as leaf nodes 

their identity, such as health records, bank details, cruninal 35 .uu' j.^- ir^u j j 

records, etc which is used to trigger retrieval of the pointer and node 

. , . , , identifier (preventing invalid attempts to step through non- 

■nie file system 10 may be directly connected to or located ^^^^^^ j^^^j^ directory). 

remotely from the scanner 20 and anaiysmg computer 30 in ™ ... juu.ut j . .i_ 

^ ^ 1 £t . J* * The radix tree directory and hash table indexes together 

a computer network. The file system 10 according to a ^ „ i , • cj . 

- ^, . . |, 1 . . f . rorma map representmg the complete organisation of data 

preferred embodiment IS a parallel processing system having 40 r c, . t c ifi * c * 

^ , ,, 1 . -ft . 1 . J . • . 11 J for the file system. In the case of a parallel system, a first 

controller logic 50 implemented in software mstalled on ^ „ j c*u * • ui 4: j- * l 

. - ^ . r J r .1. . u * controUer node of the system is responsible for distnbuting 

each of a number of nodes 60 of the system. Each system . j m j j * u- u j 

u A » ^n- : - • 4rk ji data across the system nodes 60 and determinmg which node 

node 60 IS a processing unit compnsing a processor 70 and . *li r * • j * 1 ^11 

J ^ J . * J r • IS responsible for stormg data relevant to a particular key, 

random access memory 80 and IS connected for commum- j .1. j- * j- * ■ t. u • r*u- 

^/ J . . and so the radix tree du-ectory is held in the memory of this 

cation with a respective secondary storage component 90 45 c . j -m. • * -.i. • 1 j c^i. J- 

... . ti. j Aj*i_ r firstnode. The pointers withm the leaf nodes of the directory 

which may be a peripheral storage device. A database 01 -j * r *• j j j* 1 i2i c t_ ■ * 1 

c ■ ; L ; • • J . • J * ^ . J . identify a respective node and disk file for each mput key. 

fingerpnnt characterising data is distnbuted across the stor- ^ /, I r .t. . ilu- / 

^ ^ . AA r . J ^A -.1. t. ji The other nodes 60 of the system only hold m memory the 

age components 90 of the nodes 60, with each node pro- . , r ci i . j j- 1 * r .1. j 

.J. *^ ^ . ^ J * u mdexes for files located in the disk storage of those nodes 

vidmg access to its own storage components database / j u j .. j-i r 

^ . Of ^^Qj] these indexes may be paged out to disk if memory 

°, .... . .. . r . space is limited, as noted earlier). Thus, in a parallel system, 

TTie mechanisms implementinpUocation of and access to .^^ data organisation map for the file system is not 

data Within the file system 10 wUl now be descnbed in detail. ^^j^ ^ut is distributed across the 

The stored data is arranged in hash tables 100 to permit system nodes 

eflficient data access. Hash tables are known to be well suited r^^^ / recessed b the rocessor 70 runnin com 

to databases to which data items are added in aii unpredict- 55 , ^ ^ processe y ^ processor runmng com- 

able manner. Each N bit number provides an input key '"Tn ° "^ a- 'I ,1 Si r 

.... . . ... , ..u / J .u .1 software 50, and distributed across the memory 80 of the 

which IS mapped by a hashmg algorithm (under the control , r . h r- 

♦if i en • ♦u TAX ♦ nodes 60 of the file system) to achieve allocation of input 

of the controller software 50 running on the processor 70) to j * j / • . ui n 

. - . , u- u • *u J * J * • u data and access to data m response to a search key, as will 

an integer hash value which is then used to determine where described 

the information is stored in the hash table. The bit pattern of 60 ^ , . . / , , . . , , 

each N bit number is thus used to provide a key (k^ which . .^^^ ^ ^^^.P^^ example, let us assume that the database 

is transformed to a value (X,) within the scale of a bash table "^"^^"y comprises a smaU number of data files on a single 

index 110 (i.e. a value which wiU match an entry in the ^^^^^^^ ^^^^ 

index) and this transformed key (X,) is then compared with dbfile.OO 
the index 110 to determine a storage location 120 within the 65 dbfile.Ol 

secondary storage 90 of the file system 10 for the informa- dbfile.lO and 

tion associated with that N bit number. dbfile.U 
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Each of these files is held in disk storage and contains an 
index 110 identifying a number of data bins 120 which are 
variable size contiguous storage blocks within the file. (N.B. 
a more typical example than this simple example of 4 data 
files would have 400 separate data files). In a preferred 5 
implementation, each data file contains 16,001 bins which 
each have a small amount of storage space initially allocated 
to them. All the bins are initially the same size, but if there 
is already data allocated to bins in the file or we have some 
information about the distribution of bin sizes then an effort 10 
is made to allocate storage for a bin that is appropriate for 
the expected final size of the bin. The allocated storage space 
can be increased as data is added, and the final size of each 
bin is only constrained by the maximum size of the disk files. 
This is set as 500 MB in the preferred embodiment of the 15 
invention, although it could be a higher limit closer to the 
operating system imposed maximum of, for example, 2 
gigaBytes. 

The first file, dbfile.OO, contains all the data for all keys 
that begin with binary '00*, and the other filenames are 20 
similarly representative of the data which is allocated to 
them. 

An example directory which is required to select between 
these four files when storing or accessing data initially has 
the simple structure of a "root" node (at the top level of the 25 
hierarchy) and four "leaf* nodes (at the second and lowest 
level of the hierarchy), the leaf nodes each containing a 
pointer to a corresponding one of the files. When the system 
is initialised, it builds a map representing this data organi- 
sation for the current hash table and directory structure. 30 

If the file system 10 is implemented in a parallel process- 
ing system with the files distributed over the nodes of the 
system, the generated map also identifies which node the file 
is on. In particular, the bit pattern of the N bit numbers is 
used to identify a leaf node of the radix tree directory which 35 
contains data indicating which file they should each be 
allocated to and which node that file is held on. 

For simplicity, we will describe the allocation of data in 
the simple example database mentioned above which ini- 
tially comprises four data files on a single node system with 40 
reference to FIG. 3. As data is inserted into the database, 
such as when a person is registered with the system and 
his/her fingerprint is first scanned 200, the first two bits of 
each of the generated N bit numbers (step 210) representing 
a feature point subset for that person's fingerprint are 45 
analysed (230) by comparator logic (a component of the 
controller software 50 held in the first controller node 60) 
run by the processor with reference to the directory to select 
a path from the root node to identify one of the leaf nodes. 
The selected leaf node provides a pointer to a hash table 50 
corresponding to one of the four data files (and a node 
identifier in the case of a parallel system). 

Having identified a data file for storing each N bit number 
by retrieving 230 the pointer from the directory, the index 
and associated logic for that hash table is accessed 240 in 55 
memory (or retrieved into memory if currently stored on 
disk). The N bit number is then transformed 250 into a hash 
value by a hash function comprising part of the associated 
logic of the data file. This hash value is then compared 260 
with the index of the hash table to identify a particular data 60 
bin (including obtaining its location, capacity and the num- 
ber of records currently stored within it) within the data file 
in which to store the N bit number. 

If the first two bits of an N bit number are *00*, then the 
comparison step performed at the root node of the directory 65 
identifies the directory branch to the leaf node containing a 
pointer to the hash table for dbfile.OO. The N bit number is 
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then transformed by a hashing algorithm and the result is 
compared 260 with the hash table index for dbfile.OO to 
identify a bin within that file. The N bit number and its 
associated data are then sent to and stored 300 in that bin 
within disk storage. There is no need to specify a particular 
different storage location within the bin for each data record, 
since an entered data record is stored in the first available 
storage location in the bin. 

In the preferred embodiment of the invention, each data 
entry operation is combined with a table look-up operation 
(270,280,290) which determines whether data is aheady 
held in the file system for that individual. The data entry 
policy could be to only store data if the database does not 
already include data for that individual, or to always accept 
data entry but to flag 290 matches with existing stored data 
for subsequent action by a system administrator or user of 
the system. Policies such as what reliance is to be placed on 
the automated record matching of the invention and whether 
to have a further step of an expert human fingerprint analyser 
checking apparent matches are implementation options for 
system users, and do not limit the scope of the invention. 

Of course, in alternative embodiments of the invention, 
the above described four-way branching of the radix tree 
directory following a determination by a comparator at a 
first node which looks at blocks of two bits could be replaced 
with a different tree structure. For example, a binary differ- 
entiation at each level of the directory can be performed if 
a single bit is analysed, but the process of stepping through 
the levels of a binary tree tends to be slower than for a tree 
which splits more than two ways at each level. The directory 
could equally branch at each node into eight or sixteen or 
any number of T (where *n* is an integer) if blocks of *n* bits 
are analysed. The choice of *d* must balance the benefit of 
reducing the number of separate comparison steps by mak- 
ing *n* large against the cost of unused allocated storage 
space for large *n'. 

Comparison of 4 bits at each level of the tree hierarchy 
and a consequent 16-way branching split at each level has 
been found to provide optimal use of processing hardware in 
a typical application of the invention. As the database 
increases in size, and feature point subset descriptors (N bit 
numbers) having the same initial bits and hence being 
transformed into the same hash value are added to the same 
bin, all of the storage space available in certain bins will 
eventually be filled. At this point it becomes necessary to 
extend the storage file by replacing a full bin with a larger 
bin, and changing the hash index to reflect this change. This 
is implemented by a process within the controller software 
50, as will now be described with reference to FIG. 4. When 
an attempt 310 to insert data into a bin identified by the hash 
table index generates a response indicating that the bin is full 
(320330), the controller software 50 allocates 350 a new 
area of storage to this hash table which is twice the size of 
the storage space currently occupied by the bin, thereby 
extending the total size of the hash table. The controller 
software updates 350 the hash table index to point to the 
address of the newly allocated storage space and the data 
from the full bin is transferred 350 to the larger bin at the 
new address. A full bin is easily identified since the hash 
table indices contain the current capacity of each bin and the 
current number of records in each bin as well as the bin 
address — if a comparison 320,330 between the current num- 
ber of records and the capacity finds that they match, then 
the bin needs to be expanded 350 before more data can be 
entered 340. The data entry process thus includes the step of 
checking 320 this data within a hash table index. 

ITie storage space which is freed -up by moving the bin is 
subsequently reallocated when required. Extending the data 
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files by this mechanism can be repeated as required until the cantly reduces wastage of storage space compared with 

data file reaches its preset maximum size. In the preferred methods which require a uniform directory structure with 

embodiment, the data files have a set maximum storage the same number of levels for each branch, with databases 

capacity of 500 MB. holding abstract data such as a biometric data, there is 
When a file such as dbfile.OO has reached its maximum 5 typically a very uneven distribution of data across the key 

size, any further insertion of data requires splitting of the space and so local extensibility can save a great deal of 

indexed data file, as will now be described with reference lo storage space when compared with extending the database 

FIG. 5. This is handled by a utility within the controller and keeping a uniform directory depth. Lx>cal modification 

software 50 which is set up by a system administrator to run similarly makes shrinking of the directory in response to a 

periodically 400 as a background data reorganisation pro- reduction of the data in the database much simpler than if 

cess while the database is off-line (i.e. unavailable for data entire layers of a uniform directory must be removed for 

access). The complete database is scanned 410 by this shrinking to occur, as the amount of rehashing is greatly 

process to check all data file sizes. The period between scans reduced. 

may be set by the system administrator to be one day or one After a period of time in which data has been added to the 

week, for example, and the scan may be set to run daily at database, certain of the four data files in our example will 

midnight (when no data access operations are being have been split, and under-used data files may have been 

performed), or at whatever time and frequency is considered joined. The database may now include, for example, the 

appropriate for the desired size of the data files and the following files: 

administrator's expectations as to how quickly the database dbfile.OOOO 

is likely to expand. The administrator may trigger this dbfile.OOOl 

reorganisation process manually if desired, but an automated dbfile 0010 

time-based trigger for the scanning of file sizes is preferred. dbfile 0011 

When file dbfile.OO is found by this utility to have ' 

expanded beyond the defined maximum (at step 420), it is ^' 

split into a number of files. According to the preferred dbfile .0101 00 

embodiment, one file is replaced 440 with two files (which dbfile .0101 01 

minimizes the increase in allocated storage space). dbfile .0101 10 

For example, if dbfile.OO has grown beyond the normal dbfile .0101 11 

operating range, it may be split into dbfile .001 and dbfile.OllO 

dbfile.OOO, both of which are smaller than the normal dbfile.Olll 

operating size limit. dbfile.lO 

This requires a conesponding change 440 to the radix tree dbfile 11 

directory, and this is also handled by the background reor- numerals within each file name suffix are represen- 

ganisation utifity of the controUer software 50. when the ^^^^^ .^^.^ ^^^^^^^ ^ ^^^^^^^^^3 allocated 

system restarts the radix tree directory will then indicate this 35 gj^^ represent the sequence of bits of a key used 

different key splitting. There will now need to be two file -^^^^^ ^ ^^^^^^^ ^^^^ themselves to 

indices built and held in memory where previously there had represent the directory structure. It also aUows the depth of 

only been one, so this places a small extra demand on the directory to be determined directly from the file names, 

system, but since the amount of data is growmg, this is to be 2 shows a data organisation map representing this 

expected. Access to the data m the two new files will be file structure. In this example database, there is little data for 

slightly quicker, since the bins within both files will have ^^^^ beginning '10' and ^r, so they may require only one 

become smaUer, and less collision detection or "disambigu- their data. There is more data for keys beginning 'Or , 

ation" will be needed. perhaps requiring a file just for keys beginning ^010111'. 

As another example oversize files may be replaced by Note that the splitting of files into separate files according to 

four files each havmg the same storage capacity of 500 MB ^^^^ ^^^^^ ^^^^ ^^^^^^ ^^^^^^^ ^^^^ t^^^ 

and called: directory and that the directory always forms a complete 

dbfile.OOOO binary tree. 

dbfile.OOOl Addition of data can thus lead to more files being required 

dbfile .0010 and to hold the data for certain areas of the key space and, as files 
dbfile .00 11 50 are split to create new files, the range of N bit numbers and 

In this example, the leaf node of the directory containing associated data each new file contains will be narrower. If an 

the pointer to dbfile.OO is replaced (under the control of the area of the key space does not have much associated data 

background reorganisation utility of the controller software) then the data can be held in fewer files and fewer levels of 

by a branching node at which the third and fourth bits of the the hierarchical directory are required for that area of the key 
N bit numbers beginning '00' are analysed by a comparator 55 space. The organisation of hash tables and the directory 

to identify one of four leaf nodes each containing a pointer pointing to the hash tables is optimised for the current state 

to one of the new files dbfile.OOOO, dbfile.OOOl, dbfile.OOlO of the database. This entails providing the rapid searching of 

and dbfile.OOU. Note that there is no requirement at this hashing to access data where an area of the key space is 

stage for the files dbfile.Ol, dbfile.lO or dbfile. 11 to be relatively sparsely populated with data, but replacing hash 
replaced and there is no requirement for the directory to have 60 tables with indexed data files which avoid the collision 

a uniform number of hierarchical levels for all branches. detection of conventional hashing techniques by using the 

There is also no requirement for all directory pointers to directory to point directly to the data associated with an 

point to their own separate disk file. access key where that key has a lot of data associated with 

The replacement of disk files and modification of the it (since the bin contains only data for that key). This will 
directory representing the allocation of data to files is thus 65 now be described in more detail. 

implemented as a local modification which only adds to the As the amount of data associated with certain keys 

depth of the directory at the required position. Thus signifi- increases, there will be a relatively small number of keys 
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mapped to an individual disk file. When this aumber is less described are therefore part of the process of adding data 

than a predefined threshold which represents the maximum items to the database. 

number of entries deemed appropriate for a disk file index A scanner 20 scans a fingerprint and captures a fingerprint 

(for example, 16384 entries — see below), hashing is no image which is then processed by a connected computer 30 

longer necessary and it is desirable to reorganize the disk file 5 to identify minutiae. The minutiae are processed to generate 

such that each bin holds data for only a single key. The disk a large number of N bit numbers (for example, 5000) which 

file index for identifying data bins will then provide a direct characterise the image by reference to local groups of 

access to the data relating to an input key, and no collision minutiae. This generation of data comprises the same 

detection will be required. This saving of the cost of sequence of steps described above in relation to adding new 

collision detection as the system grows is a significant lO entries to the database. However, the required operation now 

advantage of the present invention. is to compare the generated data characterising the current 

In the preferred embodiment, this reorganisation is part of scanned image with stored fingerprint-characterising data 
the background data reorganisation process implemented by for the population of people whose fingerprints are stored, 
the off-line utility of the controller software 50. When a disk This may be a verification or identification operation sub- 
file has been identified as larger than the preset maximum 15 sequent to entering a person's data into the database, or it 
size (e.g. 500 MB) and so is to be split, the reorganisation may be a check prior to adding data to the database, 
process identifies how many bits of the N bit number are to Each of the 5000 generated N bit numbers is a key which 
be used by the directory to identify the relevant disk file and is used to identify a relevant data storage bin within an 
how many bits are available for tise in the disk file to identified indexed data file, as described above for a data 
distinguish between keys. If an input key is a 30 bit number 20 entry operation. Having identified the storage bin for one of 
and 16 bits are used by the directory, then 14 bits remain for the keys, all data records within that bin are copied into 
use to distinguish between keys in the disk file and so the system memory. The controller software then determines 
number of different keys which map to that disk file is 2^^* which of the copied records relates to the key. This is 
(16384). achieved by simple comparison between the hashed key and 

Thus, when the directory depth is determined by the 25 the record data (since the hashed key is part of the stored 

reorganisation utility to be 16 levels deep at some point (as data for each record). The determination of which data 

demonstrated by certain disk tile names including 16 bit records in a bin are relevant to a given key is necessary for 

suffixes representing the data they hold), the new disk files any bin holding data for a plurality of different keys. Bins 

which are created to replace the over-sized disk file will not usually contain data for more than one key. The identifica- 

be hash tables requiring collision detection to separate 30 tion of a bin and "disambiguation" or selection of relevant 

relevant from irrelevant data within a bin, but will have records is performed for each N bit number independently of 

indices mapping each input key to its own separate data bin. the other N bit numbers generated from the same fingerprint 

If sufficient data records are deleted from certain disk image, 

files, the data may be reallocated to a smaller number of disk Each data entry operation corresponding to registration of 

files, with consequential shrinking of the directory. The disk 35 an individual and also each verification and search operation 

files are preferably all maintained at the same order of for an individual preferably includes the above described 

magnitude in size, preferably between 100-500 MB. steps for data allocation or data access. The insertion and 

If the system administrator wants to rebalance the data deletion of data items are handled in the same way as 

over a larger or smaller number of nodes or disks, they can comparisons, except that the bin is altered and then written 

thus do so while the system is off-line and the system will 40 back for insertions and deletions. 

rebuild the map of files and the radix tree directory with For concurrent processing of such a large number of N bit 

appropriate modifications when it restarts. numbers, the implementation in a paraUel processing system 

Having described allocation of data to storage locations, is advantageous. The directory is used to identify, by means 

the accessing of stored data within the file system will now of a pointer within a leaf node, the node of the system on 

be described in detaU. When a person's fingerprint is 45 which records having the same initial key bit pattern as the 

scanned, the file system 10 including the database of fin- current data item are stored. That determination provides a 

gerprint characteristics can now be used to verify that clear identification of the relevant node of the parallel 

person's identity. The data organisation described earlier system. The bit pattern of the current data item is iteralively 

enables efficient access to the database of fingerprint- analysed to select a path through the directory until a leaf 

characterising data records. 50 node is reached, and then the pointer within the leaf node 

As noted eariier, each data entry operation is preferably points to one of the nodes and one of the disk files, 

coupled with a table lookup operation to determine whether Having transformed an input key and compared the 

data is already held in the file system for that individual. For transformed key with the index of an identified hash table to 

certain databases, it will be essential for data which is identify a bin, operating system facilities are used to read 

characteristic of an individual to be added to the database 55 data records in the bin and the data records within the bin are 

only after a comparison has been performed with data analysed to obtain just the data for the current key. In the 

already stored in the database, to avoid duplicate records present embodiment, this simply involves comparing the 

appearing for the same person. An example of this is a social key with each data record in the bin. Each key search then 

services database where it is important to prevent individu- outputs a number of data records comprising the records 

als collecting welfare payments under multiple aliases, or an 60 associated with the key. Keys are not unique to an individual 

identity card database where no individual should have person and so each data record output as a result of the 

multiple identities. Thus, there is often a dual requirement search will typically identify a different person. Each of 

for the ability to verify the identity of an individual by these identifications may be thought of as a "vote" or 

checking their records, in the database and the ability to suggestion as to which person registered with the system the 

determine whether this individual already has an entry in the 65 scanned fingerprint belongs to. These identifications of 

database before new data is added. In some embodiments of individuals residting from each of the 5000 key search 

the invention, the comparison steps which will now be operations are then compared and individuals who only have 
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one or two "votes" for them are discounted. If one individual 
has hundreds of positive votes, or even 25 or 50 positive 
votes then there is a high likelihood that this is a fingerprint 
naatch since there are many common characteristics between 
the scanned and stored fingerprints. The controller software 5 
50 according to the preferred embodiment is thus able to 
analyse the 5000 search outputs and make a determine as to 
whether there is a reliable match identifying a particular 
person. 

These steps implemented for retrieving the stored data for 10 
a key may be summarised as follows: 

1. If the system is running on a parallel processor use the 
radix tree directory to decide which node the data for 
the key is on. 

2. Use the radix tree directory to decide which of a 
number of indexed data files on that node contains the 
data for the key. 

3. Use the index for the indexed data file to locate the bin 
that contains the data for the key. The indices for the 
files are read at start-up time and held in memory, 
although they may be paged out if there is not room to 
hold them all. 

4. Use the operating system to read in the bin and retrieve 
the contents of the bin into memory. 

5. "Disambiguate" the data within the bin — i.e. select just 
the data records for the given key from the records 
within the bin. 

Then the retrieved data for a large number of different 
keys generated from one fingerprint is compared to deter- 
mine whether a particular individual is identified a large 
number of times. A person skilled in fingerprint analysis may 
then check whether a match identified by the controller 
software 50 is valid. The advantages of automated scanning 
of several million fingerprints and identification of probable 
matches are clear even if such manual checking of the end 
result is considered desirable. 

In an alternative embodiment of the invention, a repre- 
sentative sample of the data to be stored is used to estimate 
the distribution of keys and this information is combined 
with the expected final size of the database to determine an 
appropriate database organisation for the expected database 
size and distribution. The database is then separated into the 
appropriate number of disk files, and the directory is built to 
enable each disk file to be accessed via a pointer held at a 
corresponding leaf node of the directory. It is also possible 
to allocate space to each of the bins according to their 
predicted size. This requires a considerable amount of disk 
space to be available at the beginning as well as accurate 
data distribution information, as an alternative to the data- 
base map and associated resources being minimized and 
only expanded when required. 

What is claimed is; 

1. A data access mechanism for a computer filing system, 
the access mechanism including; 

an extensible hierarchical directory having a plurality of 
directory entries at least some of which provide point- 
ers to respective indexed data files; 

logic for analysing a data key input to said filing system, 
by comparison of the data key with entries in the go 
extensible hierarchical directory, to identify one of said 
plurality of directory entries providing a pointer to a 
respective indexed data file of the respective indexed 
data files; 

a plurality of indexed data files for storing data; and 55 
a plurality of indices, each corresponding to one of said 
indexed data files; 
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wherein each of said indexed data files is locatable via one 
of said pointers, wherein the index for each of said 
indexed data files contains identifiers and storage 
addresses of data blocks therein, each of said indexed 
data files having associated logic for transforming an 
input data key to generate a transformed key and logic 
for comparing a transformed key with data block 
identifiers in the index to identify a match and to obtain 
the address of an identified data block within the 
indexed data file, and each of said indexed data files 
having associated logic for identifying from records 
within the identified data block one or more data 
records related to the input data key. 

2. A data access mechanism according to claim 1, 
wherein; 

the directory is an hierarchical directory comprising a 
plurality of branch nodes pointing to other nodes to 
define paths through the directory from a root node to 
a plurality of leaf nodes each forming the lowest level 
of the hierarchy for a respective path, each leaf node 
providing a pointer to an indexed data file; and 

the logic for analysing an input data key by comparison 
with the directory is adapted to iteratively analyse the 
input data key by comparison of the key with directory 
entries at each successive node of a path through the 
directory to select a branch to a next node leading to a 
respective leaf node, each iteration of the analysis at 
each successive node using one or more additional bits 
of information from the input key to distinguish 
between keys and select a branch. 

3. A data access mechanism according to claim 1, 
wherein: 

the indexed data files each have a predefined maximum 
size and are extensible up to said predefined maximum; 
and 

the data access mechanism includes logic for identifying 
indexed data files exceeding the predefined maximum 
size and for responding to a first indexed data file 
exceeding the predefined maximum size by replacing 
the first indexed data file with two or more separate 
indexed data files and by extending the hierarchical 
directory by addition of new directory nodes branching 
from the node of the directory which previously iden- 
tified the pointer to the first indexed data file, the new 
nodes forming a new lowest level of the directory 
hierarchy for the respective branch, such that the 
extended directory is adapted to use one or more 
additional bits of information from input data keys to 
distinguish between the keys and select a branch to 
identify a respective pointer to one of said two or more 
replacement indexed data files, 

4. A data access mechanism according to claim 3, wherein 
said logic for replacing the first indexed data file is adapted 
to determine when an array of keys mapped to a replacement 
indexed data file includes no more than a threshold number 
of keys and, in response to said determination, to build the 
index for said replacement indexed data file such that each 
key within said array of keys corresponds to a separate data 
block within said file. 

5. An indexed file system including a data processing 
system having a processor, memory, peripheral storage, and 
communication buses for transferring data between said 
processor, memory and peripheral storage, said indexed file 
system also including a data allocation and access mecha- 
nism comprising: 

an extensible hierarchical directory including a plurality 
of directory entries at least some of which provide 
pointers to respective indexed data files; 
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logic for use by said processor to analyse a data key input 
to said filing system, by comparison of the data key 
with entries in the extensible hierarchical directory, to 
identify one of said plurality of directory entries pro- 
viding a pointer to a respective indexed data file of the 5 
respective indexed data files; 

a plurality of indexed data files for storing data in said 
peripheral storage; and 

a plurality of indices, each corresponding to one of said 
indexed data files; 

wherein each of said indexed data files is locatable via one 
of said pointers, wherein the index for each of said 
indexed data file, contains identifiers and storage 
addresses of data blocks within the indexed data file, 
each of said indexed data files having associated logic 
for use by said processor to transform an input data key 
to generate a transformed key and logic for use by said 
processor to compare the transformed key with data 
block identifiers in the index to identify a match and to 
obtain the address of an identified data block within the 
indexed data file, and each of said indexed data files 
having associated logic for use by said processor to 
identify fi-om records within the identified data block 
one or more data records related to the input data key. 

6. A filing system for a computer including: 

a directory having a plurality of directory entries at least 
some of which contain pointers to respective indexed 
data files; 

logic for analysing a data key input to said filing system, 
by comparison of the data key with entries in the 
directory, to identify one of the plurality of directory 
entries providing a pointer to a respective indexed data 
file for storing data relating to the input key; 

a plurality of indexed , data files for storing data, each 
locatable via one of said pointers; and 

a plurality of indices, each corresponding to one of said 
indexed data files; 

wherein the index for each indexed data file contains 
identifiers and storage addresses of data blocks within 40 
the indexed data file and has associated logic for 
comparing an input data key with data block identifiers 
in the index to identify a match and to obtain the 
address of an identified data block within the indexed 
data file; and 4S 

wherein the filing system includes logic for: 

(a) identifying indexed data files which exceed a preset 
maximum file size; and 

(b) replacing said excessive-sized indexed data files 
with a plurality of indexed data files, including 50 
updating the directory to include separate directory 
entries each containing a pointer to one of said 
replacement indexed data files. 

7. A filing system according to claim 6, wherein said logic 
for replacing said excessive-sized indexed data files is 55 
adapted to determine when an array of keys mapped to a 
replacement indexed data file includes no more than a 
threshold number of keys and, in response to said 
determination, to build the index for said replacement 
indexed data file such that each key within said array of keys 60 
corresponds to a separate data block within said file. 

8. A method of accessing data records in a computer file 
system using an access mechanism the method comprising 
the steps of: 

providing an extensible hierarchical directory having a 65 
plurality of directory entries at least some of which 
provide pointers to respective indexed data files; 
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limiting the respective indexed data files to a predefined 
maximum size which are extensible up to said pre- 
defined maximum; 

analysing a data key input to said file system by compar- 
ing the data key with entries in the extensible hierar- 
chical directory in order to identify one of said plurality 
of directory entries which provide a pointer to a respec- 
tive indexed data file; 

storing data in a plurahty of indexed data files; 

providing a plurality of indices each corresponding to one 
of said indexed data files; 

locating each of said indexed data files via one of said 
pointers, the index for each of said indexed data files 
contains identifiers and storage addresses of data blocks 
within the indexed data file; 

transforming the input data key to generate a transformed 
key; 

comparing the transformed key with data block identifiers 
in the index to identify a match and to obtain the 
address of an identified data block within each of said 
indexed data files; 

identifying from records within an identified data block 
one or more data records related to the input data key; 

analysing an input data key, in response to receipt of the 
input data key, by comparison with entries in the 
directory to identify a directory entry providing a 
pointer to a respective indexed data file of the indexed 
data files; 

accessing the index for said respective indexed data file; 

transforming the input data key to generate the trans- 
formed key which is compared with data block iden- 
tifiers in the index of the indexed data file; 

comparing the transformed key with data block identifiers 
in said index to identify a match; 

obtaining from said index the address of the identified 
data block; 

identifying from records within the identified data block 
one or more records related to the input data key. 

9. A method of reorganising a database including reorga- 
nising an access mechanism comprising the steps of: 

providing an extensible hierarchical directory comprising 
a plurality of branch nodes pointing to other nodes to 
define paths throuth the directory from a root node to a 
plurahty of leaf nodes each forming the lowest level of 
the hierarchy for a respective path, each leaf node 
providing a pointer to an indexed data file; 

analysing a data key input to said file system by compar- 
ing the data key with entries in the extensible hierar- 
chical directory in order to identify one of said plurality 
of directory entries which provide a pointer to a respec- 
tive indexed data file; 

storing data in a plurality of indexed data files; 

providing a plurality of indices each corresponding to one 
of said indexed data files; 

locating each of said indexed data files via one of said 
pointers, the index for each of said indexed data files 
contains identifiers and storage addresses of data blocks 
within the indexed data file; 

transforming the input data key to generate a transformed 
key; 

comparing the transformed key with data block identifiers 
in the index to identify a match and to obtain the 
address of an identified data block within each of said 
indexed data files; 
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identifyiag from records within an identified data block 
one or more data records related to the input data key; 
and 

analysing an input data key by comparison with the 
extensible hierarchical directory by adapting to itera- ^ 
lively analyse the input data key by comparison of the 
key with directory entries at each successive node of a 
path through the directory to select a branch to a next 
node leading to a respective leaf node, each iteration of 
the analysis at each successive node using one or more 
additional bits of information from the input key to 
distinguish between keys and select a branch; 

periodically scanning said indexed data files to determine 
the size of each indexed data file; 

comparing the size of each indexed data file with a preset 
file size limit to identify indexed data files larger than 
said preset size limit; and 

for indexed data files larger than said preset size limit, 
replacing the indexed data file with two or more 20 
separate indexed data files and extending the hierarchi- 
cal directory by addition of new directory nodes 
branching from the node of the directory which previ- 
ously identified the pointer to the first indexed data file, 
the new nodes forming a new lowest level of the 25 
directory hierarchy for the respective branch, the 
extended directory then using one or more additional 
bits of information from input data keys to distinguish 
between the keys and select a branch to identify a 
respective pointer to one of said two or more replace- 30 
mem indexed data files. 

10. A method according to claim 9, wherein said step of 
replacing an indexed data file which is larger than the preset 
size limit includes the steps of: 

determining when an array of keys mapped to a replace- 35 
ment indexed data file includes no more than a thresh- 
old number of keys and, in response to said 
determination, building the index for said replacement 
indexed data file such that each key within said array of 
keys corresponds to a separate data block within said 40 
file. 

11, A computer program product comprising computer 
program code recorded on a machine readable recording 
medium, the program code including instructions for imple- 
menting the steps of: 45 

providing an extensible hierarchical directory having a 
plurahty of directory entries at least some of which 
provide pointers to respective indexed data files; 

limiting the respective indexed data files to a predefined 
maximum size which are extensible up to said pre- 
defined maximum; 

analysing a data key input to said file system by compar- 
ing the data key with entries in the directory in order to 



,795 Bl 

20 

identify one of said plurality of directory entries which 
provide a pointer to a respective indexed data file; 

storing data in a plurality of indexed data files of the 
indexed data files; 

providing a plurality of indices each corresponding to one 
of said indexed data files; 

locating each of said indexed data files via one of said 
pointers, the index for each indexed data file contains 
identifiers and storage addresses of data blocks within 
the indexed data file; 

transforming the input data key to generate a transformed 
key and logic for comparing a transformed key with 
data block identifiers in the index to identify a match 
and to obtain the address of an identified data block 
within the indexed data file; 

identifying from records within an identified data block 
one or more data records related to the input data key; 

analysing an input data key, in response to receipt of the 
input data key, by comparison with entries in the 
directory to identify a directory entry providing a 
pointer to a respective indexed data file of the indexed 
data files; 

accessing the index for said respective indexed data file; 
transforming the input data key to generate a transformed 

key which is compared with data block identifiers in the 

index of the indexed data file; 
comparing the transformed key with data block identifiers 

in said index to identify a match; 
obtaining from said index the address of the identified 

data block; 

identifying from records within the identified data block 
one or more records related to the input data key. 

12. The computer program of claim 11, including the 
further steps of: 

identifying indexed data files exceeding the predefined 
maximum size; 

responding to a first indexed data file exceeding the 
predefined maximum size by replacing the first indexed 
data file with two or more separate indexed data files 
and by extending the hierarchical directory by addition 
of new directory nodes branching from the node of the 
directory which previously identified the pointer to the 
first indexed data file; and 

forming a new lowest level of the directory hierarchy 
nodes for the respective branch with the new, such that 
the extended directory is adapted to use one or more 
additional bits of information from input data keys to 
distinguish between the keys and select a branch to 
identify a respective pointer to one of said two or more 
replacement indexed data files. 

***** 
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